value <- "Hello World"
valueLS4003 R tutorial 4
Chi-squared test in R
In this tutorial we’re going to use R to calculate Chi-squared and Fishers test results for categorical data.
Make sure you’ve completed the tutorial 1 section on using R from excel before starting here.
Install and Set-up
A refresher for how to install and set up R and RStudio.
To get set up, follow the below steps. Click each step to see the instruction and the screenrecording.
This is an online, cloud-based option. It’s a bit more limited than running on a university computer or your own computer, but the free option should be enough for this module.
Go to Posit Cloud and create a free account
Log in, then go to New Project -> New RStudio Project.
Make a new folder in the bottom right panel (by clicking the New Folder button) called “LS4003_Statistics”.
Click on this folder to enter it, and then click the More cog (bottom right panel) and select “Set as Working Directory”.
To run R on your own machine, you have to install R (the programming language) and RStudio (the development environment).
When installing, click the most appropriate option for your machine (Windows/Mac/Linux)
Once you have installed both, open RStudio.
Navigate to your Documents folder in bottom right panel. (If you can’t find it, type in setwd("~/Documents") to the console on the bottom left, then click the More cog on the bottom right and select “Go to Working Directory”)
Create a new folder called LS4003_Statistics by clicking the New Folder button on the right hand side.
Click on your folder (LS4003_Statistics) to enter it.
Set that as your final working directory by clicking on the ‘More’ cog icon again and select “Set as Working Directory”.
Dataset - Portugese secondary school students
For this tutorial we need the “student-mat.csv” file from the canvas page. This is from a study looking factors that may affect maths students at a school in Portugal, which includes lots of categorical data.
This dataset contains the following information about each student:
| Column | Data |
|---|---|
| school | student’s school (‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) |
| sex | student’s sex (‘binary: ’F’ - female or ‘M’ - male) |
| age | student’s age (numeric: from 15 to 22) |
| address | student’s home address type (‘U’ - urban or ‘R’ - rural) |
| famsize | family size ( ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) |
| Pstatus | parent’s cohabitation status (‘T’ - living together or ‘A’ - apart) |
| Medu | mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
| Fedu | father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
| Mjob | mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) |
| Fjob | father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) |
| reason | reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) |
| guardian | student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’) |
| traveltime | home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) |
| studytime | weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) |
| failures | number of past class failures (numeric: n if 1<=n<3, else 4) |
| schoolsup | extra educational support (binary: yes or no) |
| famsup | family educational support (binary: yes or no) |
| paid | extra paid classes within the course subject (binary: yes or no) |
| activities | extra-curricular activities (binary: yes or no) |
| nursery | attended nursery school (binary: yes or no) |
| higher | wants to take higher education (binary: yes or no) |
| internet | Internet access at home (binary: yes or no) |
| romantic | with a romantic relationship (binary: yes or no) |
| famrel | quality of family relationships (numeric: from 1 - very bad to 5 - excellent) |
| freetime | free time after school (numeric: from 1 - very low to 5 - very high) |
| goout | going out with friends (numeric: from 1 - very low to 5 - very high) |
| Dalc | workday alcohol consumption (numeric: from 1 - very low to 5 - very high) |
| Walc | weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) |
| health | current health status (numeric: from 1 - very bad to 5 - very good) |
| absences | number of school absences (numeric: from 0 to 93) |
| G1 | first period grade (numeric: from 0 to 20) |
| G2 | second period grade (numeric: from 0 to 20) |
| G3 | final grade (numeric: from 0 to 20, output target) |
Read the data into R and create contigency tables
First we need to read this information and store it in an R dataframe.
For our Chi-Squared and Fisher test however, we need to generate a contingency table. Effectively, we need to choose two categorical variables, and count the occurences of each combination. Luckily R will do this for us, using the table() function.
Let’s first look to see if there’s a difference between sex and whether the student does after-school activities.
Odds Ratio
We can calculate the odds ratio for our comparison using the oddsratio() function from the epitools package.
You’ll first need to install.packages('epitools').
This result shows us the odds of male students doing an activity (compared to females) is 1.491 -> they are around 1.5 times more likely to be doing an activity.
Also in this table are the “lower” and “upper” values are the boundaries for the 95% confidence interval. As both values are above one, this indicates increased odds (of males doing an activity compared to females).
We can do the same calculation the other way around, looking at the odds of females doing an activity compared to males.
The options you can give for the rev (reverse) parameter are “neither”, “rows”, “columns” and “both”.
Risk ratio
Risk ratio is very similar to odds ratio (refer back to Statistics Lecture 4 for an explanation of the differences).
We can also calculate risk ratio in R.
This shows the risk of males doing an activity (compared to females) is 1.217.
Chi-squared test
As our grand total of observed values is more than 50 and every observation exceeds 5, we can use the chi squared test.
As the p-value is 0.0597, it is not less than 0.05 and therefore not significant with less than a 5% chance of error.
Fisher test
We can also do a fisher test if we’re looking at categories with a smaller value.
If we compare sex and whether or not the student wants to go to higher education, you’ll see we have very low numbers of students that don’t want to go to higher education in this class.
We can then use the fisher test:
Chi-squared with multiple categories
Finally, let’s look at an example with multiple categories in one of the variables.
Perhaps there is a difference between the reasons students chose the school and which school they attend.
We can then run the chi-squared test exactly the same as before:
The p-value is 0.00595 (is this less than 0.05?)
We can visualise which factors had the biggest contribution to the result by doing the following calculation:
\(100 \times \frac{residuals^2}{statistic}\)
We can then visualise this with corrplot.
There is an issue with web r so this plot isn’t interactive
Testing a subset of the values
The strongest contributions appear to be in the “reputation” and “other” categories.
To see if these are statistically significant, we can create a new database with just these values so we can then compare them.
Once we have our contigency table, we can then run our chi-squared test:
Extension
This was a very large dataset and we only looked at a few columns.
What else can you find out?
Not all of the columns are categorical data - if you want to compare any other two variables (e.g. whether they went to nursery and their current grades); which test would you use?
We’ve covered all the main statistical tests now, so try and apply what you’ve learnt from every tutorial to this database.